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Abstract 



This paper reviews various procedures for constructing an interval for an 
individual's true score given the assumption that errors of measurement are distributed as 
binomial. This paper also presents two general interval estimation procedures (i.e., 
normal approximation and endpoints conversion methods) for an individual's true scale 
score; compares the various interval estimation procedures through computer simulation 
studies by evaluating how close actual coverage probabilities are to selected nominal 
levels (i.e., .95, .68. and .5); and provides some practical guidelines for use of the interval 
estimation procedures. To examine the effects of different types of scale scores, four 
non-linearly transformed scale scores are employed. The conditional confidence 
intervals using conditional standard errors of measurement are recommended over the 
traditional confidence intervals using the overall standard error of measurement, 
especially for lower nominal levels. The score confidence interval, Bayes confidence 
interval, and credibility interval tend to provide the actual coverage probabilities that are 
closest to the nominal levels, on average. Results for scale score intervals appear to favor 
the endpoints conversion method using the true-score conversions over the normal 
approximation approach. 




111 o 



Interval Estimation for True Scores Under Various Scale Transformations* 

Introduction 

One of the goals of educational and psychological measurement is to estimate 
examinees' true scores. A point estimate of the true score may not be very meaningful 
without being accompanied by some measure of the errors involved in a measurement 
procedure. Standard errors of measurement (SEMs) typically are used to report the 
amount of measurement error in test scores. One very practical use of SEMs is in making 
inferences about an examinee’s true score via confidence intervals (Lord & Novick, 
1968). Traditionally, confidence intervals have been constructed using a strong 
assumption that measurement errors are normally distributed and the standard error of 
measurement is the same for all examinees (Feldt & Brennan, 1989). The traditional 
definition of SEM (i.e., same for all examinees) is sometimes called the overall SEM in 
the sense that it is an average SEM for all examinees in the population. 

A large volume of measurement literature, however, has been devoted to the 
theoretical developments and empirical justification for SEMs that differ at different 
points on the score scale (Brennan, 1996, 1998; Feldt, 1984; Feldt & Qualls, 1996; Feldt, 
Steffen, & Gupta, 1985; Lord, 1955, 1957, 1984; Mollenkopf, 1949; Qualls-Payne, 1992; 
Thorndike, 1951). As opposed to the overall SEM, the SEMs associated with 
individuals’ specific score levels are referred to as conditional SEMs. When a confidence 
interval is constructed for an examinee with a particular true score using the examinee’s 
conditional SEM, the interval is referred to here as a conditional confidence interval. 
Note that we can form a confidence interval for an isolated individual using either the 
overall SEM or conditional SEM. It has been suggested in the literature, however, that 



' A previous version of this paper was presented at the Annual Meeting of the National Council on 
Measurement in Education, Montreal, April 1999. The authors thank Chiou-Yueh Shyu and Matthew 
Schulz for their helpful comments on the paper. 
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confidence intervals be based on conditional SEMs, not on the SEMs for a test as a whole 
(Feldtetal., 1985; Harvill, 1991). 

The SEMs and confidence intervals can be stated in terms of both raw scores (i.e., 
number-correct scores) and transformed scale scores. In recent years, some procedures 
have been developed for estimating conditional SEMs for scale scores (Brennan & Lee, 
1997, 1999; Feldt & Qualls, 1998; Kolen, Hanson, & Brennan, 1992; Kolen & Wang, 
1998; Kolen, Zeng, & Hanson, 1996; Wang, Kolen, & Harris, 2000). These procedures 
could be readily used to construct a conditional confidence interval for an individual’s 
true scale score. The conditional confidence intervals for raw scores or scale scores using 
conditional SEMs have not been considered extensively. 

Bayesian inference also provides a means of constructing an interval for an 
individual. The resultant intervals are often called credibility intervals (Novick & 
Jackson, 1974, pp. 119-126). A credibility interval for an examinee gives information 
about the distribution of the examinee’s true score (i.e., posterior distribution), given 
one’s prior knowledge (i.e., prior distribution) and the observed score. As discussed 
later, credibility intervals differ from confidence intervals in several ways. These two 
general approaches are compared in this paper in terms of estimation accuracy. 

The present paper (1) reviews various procedures for constructing intervals for 
raw scores, (2) presents two general interval estimation procedures for scale scores, (3) 
compares the various interval estimation procedures through computer simulation studies 
by evaluating how close the actual coverage probabilities are to the nominal levels, and 
(4) provides some practical guidelines for use of the interval estimation procedures. To 
examine the effects of raw-to-scale score transformations, four different types of scale 
scores are used: developmental standard scores (DSSs), grade equivalents (GEs), 
percentile ranks (PRs), and stanines (STs), which are all non-linear transformations of 
raw scores. Developmental standard scores are the primary score scale that is reported to 
test users for the Iowa Tests of Basic Skills (TTBS) (Hoover, Hieronymus, Frisbie, & 
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Dunbar, 1993a). Petersen, Kolen, and Hoover (1989) describe these four types of scale 
scores in some detail. 

Note that some interval estimation procedures discussed in this paper were 
developed only for the binomial parameter. Thus, to establish comparability across 
procedures, the binomial error model is considered to be an underlying distribution of 
errors for all procedures. Accordingly, the simulation is based on the binomial error 
model as well. 



Intervals for Raw Scores 

This paper considers six different interval procedures for raw scores: (1) 
conditional confidence intervals using conditional SEMs, (2) traditional overall 
confidence intervals using the overall SEM, (3) score confidence intervals, (4) Bayes 
confidence intervals, (5) Clopper-Pearson exact confidence intervals, and (6) credibility 
intervals. The first four procedures, in effect, are based on normal distribution 
assumptions in one way or another, while the Clopper-Pearson exact confidence interval 
uses the binomial distribution, which is the "exact" distribution for the observed scores 
for a person. Credibility intervals often use a beta distribution to describe the posterior 
distribution. The binomial error model and some issues related to confidence intervals 
for the binomial parameter are discussed first followed by the overviews of the interval 
procedures. 

Confidence Intervals for a Binomial Parameter 

Let X denote a random variable for an examinee's observed number-correct score. 
Further, let 7 be the true score for an examinee, which is defined as the expected value 
of the observed scores obtained from repeated measurements. Under the binomial error 
model, the conditional distribution of observed score X given an individual's proportion- 
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correct true score, ;r = r/fc , on a test consisting of k dichotomously-scored items is the 
binomial distribution (Lx>rd & Novick, 1968): 



ri.\ 



Vx{X=x\7t) = 






7r\\-7V) 



k-x 



( 1 ) 



That is, X is binomially distributed with a mean of kK = r and a standard deviation of 
^k7t{\-7C) . 

Let X be a random variable for the observed proportion-correct score, and 
consider the problem of determining a confidence interval for ;r . From the Central Limit 
Theorem, (X -7t)l k has a limiting standard normal distribution, N(0,1) . 
Using theorems on limiting distributions (Hogg & Craig, 1995, pp. 253-255), it can be 
shown that (X —tt)/ yJxil-X)/ k has a limiting distribution of N(0,1) as well. Thus, 
we have 



Pr -z <-—^=£=<2 
^X(l-X)/k 



7. 



( 2 ) 



where / is a probability value, and denotes (l-i- 7 )/ 2 th quantile of the standard 
normal distribution. For example, for 7 = .50, = .6745 ; for 7 = . 68 , = 1 .0 ; and for 

7 = .95 , Zy = 1.96 . From Equation 2, it is immediate that 

Pr[x-z,(T^(jf)<;r<X-hz,(J,(^J = 7. (3) 

where =^X(l-X)/fc is the estimated standard error for X , and the subscript “e” 
represents the error of measurement. 

Equation 3 gives an interval for zr , (x -Zy(T^(x), X + Zy(T^(X)) > which has two 
endpoints that are random variables each of which is dependent upon X; they will be 
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denoted as and X^ . So, it can be said that, prior to data collection, the probability 
that the random interval (X^,Xy) includes the unknown parameter ^ is y . Suppose we 
have collected data, then the two endpoints are known and the particular realized interval 
either does or does not cover tt . However, if many such intervals were 
constructed over repeated applications of an interval estimation procedure, about 
(100;')% of them would cover the parameter (Feldt & Brennan, 1989). The obtained 
interval 7(;r) = ) is called a (100;')% confidence interval for , and ;' is the 

confidence coefficient. In the statistics literature, the interval [x - ^ 

is often called the Wald confidence interval for ;r because it is derived from the Wald 
test for 7T . 

The fact that (X —n)l^n(i—K)lk has a A^(0,1) limiting distribution implies 
that X is distributed approximately as N{k7r,k7r{\-7r)\ as k goes to infinity (Hogg & 
Tanis, 1993). One can obtain a confidence interval for r , 7(t) , by multiplying the two 
endpoints of I{k) by k, since r = Kk. As a result, 7 (t) has the form 
+ ) , which is referred to here as the Wald confidence interval for T , 

where =ylkx(l — x) =.>Jx(k-x)/k is the standard error for X. The confidence 
intervals expressed in two different metrics (i.e., total score vs. mean score) should not be 
confused. 

Conditional Confidence Intervals Using Conditional SEMs 

The rationale for conditional confidence intervals to be discussed here clearly 
parallels the rationale for the Wald confidence intervals discussed previously. The only 
difference is that conditional confidence intervals use an unbiased estimate of SEMs, 
which is called the Lxjrd’s SEM in the measurement literature. Under the binomial error 
model, Lxjrd (1955, 1957) provided an estimated raw-score SEM for an examinee with x 
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'e(X)c 



x(k-x) 

k—1 




x(k - x) 



(4) 



where yjk l(k - 1) is a bias-correction factor to remove the bias in the variability of the 
sample. Then, the approximate confidence interval for the examinee has exactly the 
same form as the Wald confidence interval except that d'c^x)c used: 

~ (^~ ^r*^e(X)c’ ^r^e(X)c ) ' (5) 

It is well known that the normal approximation to the binomial distribution works 
best when k is large and is close to .5, and many authors have suggested mles of 
thumb (see Leemis & Trivedi, 1996) for appropriate use of the normal approximation. 
For example, Hogg and Tanis (1993) considered k sufficiently large if kn>5 and 
k(l-;r) > 5 , or a A: of at least 30 in all cases. Note also that would be zero for 

examinees with zero or perfect scores. Thus, it is very likely that the actual coverage 
probabilities for those examinees will be lower than the nominal coverage level. 

Traditional Confidence Intervals Using Overall SEM 

It has been customary to construct a (100 y )% confidence interval for r using the 
normal approximation in conjunction with a strong assumption of the same SEM for all 
examinees. The traditional overall confidence interval is constructed as 

~ {^~ ^r^e{X)o ’ ^r^e(X)o ) ’ (6) 

where d'c(x)o estimated overall SEM. The estimated overall raw-score SEM can be 
obtained by ^e(x)o ~ ' which is the square root of the average of Lord’s 

error variances for all N examinees in the sample (Brennan, 1996; Brennan & Kane, 
1977). [In the terminology of generalizability theory, this is a A -type error variance.] 
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Score Confidence Intervals 

The score confidence interval was first discussed by Wilson (1927), and some 
studies have recommended it over other confidence intervals for K (Ghosh, 1979; 
Agresti & Coull, 1998; Santner, 1998). The score confidence interval uses the population 
standard error of X , rather than the estimator in Equations 2 and 3. The two endpoints 
of a score confidence interval are obtained from the fact that {X -tr)lyjtr{l~;r)lk has a 
A(0,1) limiting distribution. Then, a probability statement similar to Equation 2 can be 
made: 



Pr -z <-p= 

yj7t(l-7t)/k 



7 

— 



= 7. 



(7) 



Unlike Equations 2 and 3, Equation 7 is not directly solvable for n , but the 
solution is not very complicated. Equation 7 is equivalent to 



Pr 



K{\-7t)lk 




= 7. 



( 8 ) 



The term in brackets in Equation 8 can be written as a quadratic equation for k . 
The two zeros of the quadratic equation in 7t form the endpoints of the score confidence 
interval for K, Now, the score confidence interval for T is it times the two 

endpoints of (tr ) , which has the form 



r 



I..{r) = k 



X+ Zy t'l 

k + zl 









x(k-x)/k-hZy/4 x-hZy/2 



(k + Zy) 



k + zl 



+ Zv 



lx(k-x)/k + Zy/4 



(k + Zy) 



(9) 



The midpoint of (r) can be rewritten as /(k + Zy)] + k/ 2[Zy t(k + Zy)], which 
obviously falls between x and kl2. This midpoint, in effect, shifts the midpoint of the 
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conditional and traditional confidence intervals, x, toward k/2. In addition, the 
multiplier of Zy in Equation 9 shows that the problem of zero standard errors for 
examinees with x = 0 or is not present for this interval. The term under the square root 
will always result in a positive number regardless of the value of x. 

Bayes Confidence Intervals 

The normal approximation may not be very accurate when an examinee's 
observed proportion-correct score is near zero or one. An alternative estimate of K , 
rather than x , would be a Bayes estimate K = {x-^a')l{f.->ra-\- p) , which appears to give 
a more reasonable estimate than x , especially for the extreme values of X (Chen, 1990). 
The value of it is the mean of the posterior distribution using the beta prior distribution 
with parameters a and . Chen (1990) recommended oc = = z^!2, which shrinks the 

individual's observed proportion-correct score to .5. The endpoints of this interval can be 
obtained by replacing x in the Wald confidence interval with ft : 



h(T) = k 



7t-Z^ 



\7t{\.-7t) 



,7t-<rZy 



\7t{\-7t) 



( 10 ) 



Note that the midpoint of the Bayes interval with a = P = z],l 2 is the same as 
that of the score confidence interval, since ft = {x->fa)l{k + a+P) = {x+z^yl2)l{k->t-zl). 
As for the score confidence interval, the adjusted standard error for x , -y/^(l -ft) Ik , 
prevents the estimate from being zero for examinees with zero or perfect observed scores. 

Clopper-Pearson Exact Confidence Intervals 

The confidence intervals discussed so far are based on limiting distribution theory 
using N(0,l) ■ There are a few confidence intervals for the binomial parameter 7t , which 
use the binomial distribution (i.e., “exact” distribution for X) rather than the approximate 
normal distribution (Blyth & Still, 1983; Clopper & Pearson, 1934; Crow, 1956; Sterne, 
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1954). All exact confidence intervals are considered to be conservative in that they are 
guaranteed to have the coverage probability of at least y for any examinee with a 
particular true score (Santner, 1998). The conservatism of exact confidence intervals is 
due to the fact that the binomial distribution of X is discrete, and thus an exact 
probability, say, .95, can not be attained (Agresti & Coull, 1998; Hogg & Craig, 1995). 

Among several methods for constructing exact intervals, the Clopper-Pearson 
(1934) interval is the first and probably most widely known. The endpoints of the 
Clopper-Pearson confidence interval for T are obtained as follows; is k times the 
value of Tt such that Pr(X >x\7t,k) = {\-y)l2 and is k times the value of K such 
that Pr(X <x\7t,k) = {\-y)l2, where 



The lower bound is taken to be 0 when x = 0 , and the upper bound is taken to be 1 when 
x = k . 

A simple method of solving Equations 11 and 12 involves using either the 
incomplete beta distribution or the F-distribution. Let IB^{a,P) be the incomplete beta 
distribution for it with parameters a and P . Then, is k times the value of k for 
which IB^{x,k-x+\) = {\-y)l2, and is k times the value of 7t for which 
IB^ {x + \,k — x) = {\ + y) ! 2 . Using the F-distribution, the Clopper-Pearson interval is 



k 

Pr(X >xl;r,fc) = ^ , {l-Ttf'’ 



( 11 ) 



and 




( 12 ) 



/,(T) = fe({l + [v,/vJF, 









Vy.v^.0-yV2 




(13) 
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where v, = 2(Jc—x -¥\) , Vj = 2j: , Vj = 2{k-x) , and = 2 (j: + 1) are degrees of freedom, 

and denotes the (1±J') /2 th quantile of the F-distribution. 

The Clopper-Pearson confidence interval is considered to be appropriate even for 
a small k. However, the actual coverage probability of this interval can be much larger 
than the nominal confidence level due to its conservatism. 

Credibility Intervals 

The statistical method of inference underlying credibility intervals is Bayesian 
statistics. The Bayesian estimation approach takes into account both the test score and 
any prior knowledge about the examinee's true score. Equation 1 enables us to compute 
the probabilities of various observed scores for a known value of 7t . Conversely, the 
Bayesian approach considers the problem of inferring the value of 7t given X = x. In 
Bayesian statistics, all the information for making inferences about an examinee's true 
score is contained in the conditional distribution of the examinee's true (proportion- 
correct) scores given the observed score. Note that we now consider ;r as a possible 
value of the random variable IT rather than a constant for an examinee. Presumably, IT 
is a continuous variable with an interval of 0 < IT < 1 . 

The conditional distribution of 11 given X =x, g{7t\x), is called the posterior 
distribution. Let f{x\7t) denote the conditional probability density function of X, given 
n = ;r . The goal of the Bayesian inference is to obtain the posterior distribution, 
gitt\x), using a subjectively selected prior distribution, h{7t) , and /(jcl;r). A 
( 1007 ')% credibility interval is constructed by taking the {\.-y)l2 and (1 + t ')/2 
quantiles of the posterior distribution. An interval constructed in this way is sometimes 
referred to as a central or an equal-tailed credibility interval (Novick & Jackson, 1974). 

According to Bayes theorem. 





g{7t\x)oc f{x\7t)h{7t). 



(14) 
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where the symbol o= is read "proportional to". The conditional observed score 
distribution, /(x l;r) , is already known as the binomial model, and the beta distribution, 
B{a,P ) , is typically used as the prior density. The beta distribution has two parameters 
a and , and by varying the two parameter values, one can obtain a family of beta 
densities whose functional form is very similar to that of the binomial distribution. Due 
to their similar density forms, the two distributions combine in a very convenient way. 
Using B{a,P) as the prior density, /i(;r) ;r“"'(l-;r)^"‘ and /(xl;r) ;r"'(l-;r)*”''. 

Thus, 






(15) 



It is obvious from Equation 15 that the posterior distribution has the form of another beta 
distribution with parameters x+a and k-x + fl . 

The endpoints of a (100y)% credibility interval for tt are computed using the 
incomplete beta distribution with parameters x + a and k — x+fl. That is, is the 
value of 7T such that IB^{x+a,k-x + P) = {\-y)l2 , and x^ is the value of tt such that 
IB^{x + a,k — x + P) = {\ + y)H . A (I00y)% credibility interval for T is obtained by 
multiplying k by the two endpoints. 

When the beta prior has parameters a = yff = 1 , it is a uniform distribution on the 
interval [0,1] implying that all values of O are equally likely. This particular prior 
distribution is often called a non-informative prior. For the non-informative prior, the 
posterior distribution is 5(x + l,k — x + 1) . In this paper, the credibility interval for T 
using the non-informative prior, Ip(T), is compared with the confidence intervals 
previously described. 

It is important to recognize that credibility intervals differ from confidence 
intervals in terms of logical interpretation. In a (100y)% credibility interval, (I00y)% is 
the probability attached to the particular interval obtained for an examinee (Novick & 
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Jackson, 1974). As such, we can make a probability statement that the probability of an 
examinee's true score falling between the two endpoints of a credibility interval is 
(I00y)% . It is the direct use of the distribution of IT (i.e., posterior distribution) rather 
than the observed score distribution that makes possible the probability statement. By 
contrast, for a (I00y)% confidence interval, the probability attaches to the interval 
estimation method, not to the particular realized interval (Novick & Jackson, 1974). A 
confidence interval is constructed based on the observed score distribution, and the 
particular interval either does or does not contain the true score. As mentioned earlier, a 
confidence interval is typically interpreted as follows; if the interval estimation method 
were applied an infinitely large number of times, it would produce (I00y)% intervals 
that cover the true score. 



Intervals for Scale Scores 

In most testing programs, raw scores typically are transformed to scale scores for 
the purposes of reporting and making decisions about examinees. Thus, if intervals are to 
be reported, they would be most informative if expressed in terms of scale scores. Two 
general procedures are considered in this paper for constructing intervals for scale scores. 
The first, the normal approximation method, might be used for scale scores in 
conjunction with conditional scale-score SEMs or overall scale-score SEMs. The second, 
the endpoints conversion method, provides scale-score counterparts of any raw-score 
interval by converting the lower and upper endpoints of an interval for r to 
corresponding scale scores according to the functional relationship between raw scores 
and scale scores. 

Normal Approximation 

Let 5 be a random variable for scale scores that are transformed from raw scores, 
X, using the transformation function, u(X) . Let ^ and denote the true scale score 

O 
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and the scale-score SEM for an examinee, respectively. Here, ^ is defined as a mean of 
scale scores obtained over repeated measurements, that is 

^ = E[M(X)] = ^M(i)Pr(X=/l;r), (16) 

1=0 

where E is the expectation operator and Pr(X =i\7r) is given in Equation 1 . 

Suppose u(X) is a linear transformation function, u{X) = A{X) + B . Then, the 
shape of the conditional distributions for X and S will be the same, which makes it 
sensible to use the normal approximation for scale scores to the extent that it is sensible 
for raw scores. More specifically, if the limiting distribution of X is N[k7T, k7r{l - ^)] , the 
limiting distribution of S is N[Ak7r+B,A^k7r(l-7r)] . For a linear transformation, 
is simply Aa^^^y Using Lord's SEM (i.e.. Equation 4) as the raw-score SEM, the 
conditional confidence interval for ^ for an examinee with an observed scale score s is 

(^) = { -^ - ^A(S)c > ^ A(S)c ) - ( 

where = A.^x{k — x)/{k-l) . The overall confidence interval for ^ , under the 

linear transformation, can also be constructed as 

2 A(S)o > ) . (18) 

where (Tf(S)o = same value of <y,^s)o ^1 examinees. 

Note that the normal approximation will result in exactly the same coverage 
probabilities for the linearly transformed scale scores and corresponding raw scores. For 
a linear transformation, ^ = E[m(^)] = E[AX^+5] = -4E(^) + 5 = m[E(^)] = m('Z'), 
which clearly indicates that the transformation parameters A and B are for both observed 
and true scores. If and only if < r < Xy , then Ax^ + B < At +B < Axy + B , provided 

Er|c i9 
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that A is a positive value. Consequently, the coverage probabilities for /(^) and 7 (t) 
will be the same under a linear transformation— the same argument applies to both 
conditional and overall confidence intervals. For non-linear transformations, however, 
the coverage probabilities for /(^) and I{r) will not be identical, because 
^ = E[m(.X^)] ^ m[E(.X^)] , in general. In other words, the relationship between X and S is 
not the same as the relationship between T and ^ for a non-linear transformation. 

A bigger concern about the normal approximation approach for the non-linearly 
transformed scale scores is that the assumption of the limiting normal distribution may 
not hold, because the non-linearity distorts the shape of the conditional distribution for S. 
Even though the normal approximation might work reasonably well for "moderate" non- 
linear transformations, it may not be appropriate for "severe" non-linear transformations. 
This paper considers four different types of non-linearly transformed scale scores, each of 
which has a different degree of non-linearity; and evaluates the performance of the 
normal approximation when applied to the various scale scores. 

For a non-linear transformation, <j^^s) is not simply because the slope 

parameter A changes along the score scale. There exist several procedures for estimating 
conditional scale-score SEMs (CSSEMs) when u(X) is non-linear. In this paper, a 
method called the binomial procedure (Brennan & Lee, 1997, 1999) is employed. The 
binomial procedure provides > which can be viewed as a scale-score analogue of 
Lord's SEM: 



'c(S)c 






Xh(i)fPr(X=il^ = x)- 



1=0 



^u(i)Pr(X =il^ = x) 



(19) 



1=0 



The conditional and overall confidence intervals for non-linearly transformed scale 
scores, respectively, can be constructed by Equations 17 and 18 using in Equation 

19 . 
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Endpoints Conversion 

As discussed in the previous section, one problem with the normal approximation 
method to constructing confidence intervals for non-linearly transformed scale scores 
arises due to the direct use of the conditional scale-score distributions for which the 
normality assumption seemingly does not hold. Another approach to constructing 
intervals for scale scores presented in this section is free from such a problem because it 
does not assume any distributional form of the scale scores. The method here called 
"endpoints conversion" finds the endpoints of scale-score intervals by converting the 
endpoints of raw-score intervals through a functional relationship between raw and scale 
scores. In effect, the scale-score counterpart of any raw-score interval can be obtained by 
the endpoints conversion method. 

There seem to be at least two functional relationships that can be used for 
converting the endpoints. Obviously, the actual observed score transformation, u{X), 
could be used, with which the endpoints for the scale-score counterpart of a raw-score 
interval are obtained as 

7„(^) = (m[xJ, M[Xy]), (20) 

where and are the two endpoints of the raw-score interval. However, there is a 
complexity. To use a preexisting conversion table, u{X), it must be assumed that the 
transformation is a continuous function, because and Xy are often non-integer values 
and the corresponding scale scores can not be read directly from the conversion table. 
Thus, an interpolation procedure is usually needed. 

Another alternative is to use the relationship between true scores and true scale 
scores. Let v denote the transformation function from T to ^ such that ^ = v{t) . Once 
the true-score conversion function, v, is determined, the two endpoints of a raw-score 
interval, x^ and Xy , are substituted for Z . For the case considered in this paper, v is 
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defined by Equations 1 and 16, and the two endpoints of the scale-score counterpart can 
be obtained by substituting x^//c and Xij Ik (if and Xy are in the total score metric) 
for 7C in both equations. Let us express the resultant scale-score interval as: 

V[Xy]). (21) 

Note that the notation of 7„(^) and /^(^) indicates that they are generic intervals. 
That is, the endpoints of the intervals can be obtained from any raw-score endpoints, and 
different raw-score endpoints will result in different endpoints for /„(^) and /„(^). 
When the true-score conversion, v, is used, the coverage probability of the scale-score 
counterpart, /„(^), will be equal to that of the raw-score counterpart regardless of 
whether the raw-to-scale score transformation is linear or non-linear, because the 
endpoints and true scores are converted through the same conversion function, v. By 
contrast, the coverage probability of /„ (^) will not be the same as that of the raw-score 
counterpart, because the endpoints and true scores are converted through different 
conversion functions, u and v. In general, the true score conversion approach seems more 
reasonable. It is consistent with the fact that the endpoints are in the metric of true 
scores. The endpoints are almost always non-integer values and are compared with the 
true score to determine the coverage. Moreover, the same coverage probability for raw- 
and scale-score intervals seems appealing in practice. 

The two approaches will produce very similar results, however, for a nearly one- 
to-one transformation. Figure 1 depicts the two conversions for the four types of scale 
scores for TTBS Vocabulary (A: =34), Form K, Level 10. The observed-score 
conversions are the ones that are used operationally for the test, and the true-score 
conversions were computed using Equation 16. Note that Equation 16 will result in zero 
values for ^ when t is either zero or k. Hence, the maximum and minimum values of ^ 
were set to equal the maximum and minimum values of the scale scores in the observed- 
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score conversions. The label "Raw Score" for the horizontal axis in Figure 1 should be 
interpreted as either the true score or observed score depending upon what type of 
conversion is under consideration. Notice that the two conversions are extremely similar 
for DSSs, GEs, and PRs, largely because they are one-to-one functions throughout most 
of the score ranges. The largest difference is found in the raw-to-ST conversion, where 
many observed raw score points are converted to the same stanine point. The true-score 
conversions are strictly increasing functions (i.e., one-to-one at any score point), and 
appear to be smoother than the observed-score conversions, in general. This paper 
considers the true-score conversions only. Some results based on the observed-score 
conversions are discussed by Lee (1998). 

Numerical Example 

The interval estimation procedures discussed in the previous sections are 
illustrated using the same test with the conversion table shown in Figure 1 . Note that the 
confidence intervals using the overall SEMs are not considered in this example because 
they require actual examinee data. Table 1 displays actual endpoints of the nominal 68% 
raw and DSS intervals at five different score points: x = 5 (Z>55 = 141) ; 

X = 10 (DSS = 165) ; X = 17 {DSS = 187) ; x = 25 {DSS = 207) ; and x = 30 {DSS = 229) . 
The shaded areas in Table 1 indicate the endpoints of DSS intervals obtained through the 
endpoints conversion method with the true-score conversions, I^{^) . The lower panel of 
Table 1 shows the actual raw-to-DSS conversion for the test, which is used to calculate 
the DSS intervals (Hoover, Hieronymus, Frisbie, & Dunbar, 1993b). Readers can verify 
the results reported in Table 1 using this conversion table. 

The endpoints of (t) and (t) tend to be closer to each other than to any other 
intervals for the nominal level of 68%. Moreover, it can be verified that the midpoints of 
I^{r) and /j,(t) are always the same and shifted toward k/2 from the midpoint of 
I^{t ) , which equals x. That is, the midpoint of l^{r) and I^{t) is larger than x when 
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x<k/2 and smaller than x when x>k/2. The midpoint shift toward kl2 is also 
observed for 7^(T) and Ip(T). At x = k/2, which is 17, the midpoints of all five raw- 
score intervals are equal to 17. Note that all procedures produce different endpoints, 
although rounding may cause some endpoints to appear equal. 

For the DSS intervals, only has midpoints that are equal to the observed 

DSS scores. The midpoint of /^(^) converted from 7 ^(t) does not necessarily equal the 
observed DSS score because of the non-linearity of the raw-to-DSS transformation. 
Likewise, the midpoints of the DSS counterparts of and 7 ^(t) are not necessarily 
the same although their raw-score counterparts have the same midpoints. Another 
important property of the various interval estimation methods is the lengths of the 
intervals. The lengths of each interval across the score scales are plotted in Figure 2. 

First notice in Figure 2 that the patterns of the various interval lengths are very 
similar. Actually, the shapes of the interval lengths reflect the shapes of the conditional 
SEMs (presented later in Figure 8)-large (small) conditional SEMs lead to wide (narrow) 
intervals. The irregular pattern of the DSS interval lengths is due to the non-linear 
character of the raw-to-DSS transformation. Notice also that the lengths of 7^(r) for 
both the raw and DSS scores are remarkably larger than the other intervals throughout the 
score range, which, as discussed later, is closely related to the fact that the coverage 
probabilities of 7 ^(t) are exceptionally large. The lengths of the intervals except for 
7 ^(t) do not appear to be very different except at both ends of the score scales. 
Especially, 7^(^) and 7^(^) converted from 7 ^(t) exhibit very similar lengths of the 
DSS intervals even though the endpoints of the two intervals are not very close to each 
other (see Table 1). Note that the lengths of 7^(^) converted from 7 ^(t) and 7^(^) are 
zero for the zero and perfect raw and corresponding DSS scores, which is caused by the 
zero estimated conditional SEMs. The approximate ascending order of the raw-score 
interval lengths, mainly in the middle of the score scale, is 7p(r), 7^(t), 7^(t), 7^.(t), 
and 7^(r). The same ordering applies to the corresponding DSS intervals. All other 
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things being equal (such as the same coverage probability), a method with narrow 
intervals would be preferred. 



Simulation Study 

Since all procedures discussed in this paper are associated with the binomial 
distribution of errors in one way or another, a simulation was conducted based on a 
model called the beta-binomial model (Keats & Lord, 1962; Lord & Novick, 1968), 
which assumes that errors are distributed binomially. The beta-binomial model is known 
to fit many observed score distributions very well. In order to generate random data that 
are as realistic as possible, a real test data set initially was used for specifying the 

simulation conditions. This simulation study used data from Level 10, Form K of the 

Vocabulary subtest (k = 34) in ITBS— a random sample of 3000 examinees at grade 4 
(Level 10) was selected from the 1992 Spring standardization sample. 

Under the beta-binomial model, the conditional distribution of X given is 
binomial, and tt is distributed as beta with parameters a and /?, Let 

Ai denote KR21 reliability, where fi and 5 are the 

mean and standard deviation of the test scores for the 3000 examinees. The parameters 
a and /? were estimated using the following formulas (see Huynh, 1976; Jarjoura, 
1985): 



a = fi 




^Pi\ 



7 



and 

Pi\ 



( 22 ) 



Equation 22 yielded d=3.4 and = 1.9 , which suggests that the distribution of 
;r is a bit negatively skewed. Negatively-skewed distributions of test scores are typical 
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for many standardized achievement tests. Treating the two parameter estimates as 
population parameters, true proportion-correct scores, K, were generated for 1000 
simulees from fl(3.4,1.9) . In generating random beta deviates, the acceptance-rejection 
method was employed as described in Mooney (1997, pp. 25-30). For each simulee, the 
conditional observed score distribution, Pr(X =x\tc), was computed using Equation 1, 
and the true number-correct score and true scale score were computed as r = kK and 
^ Pr(X =i\n) . Then, the following steps were executed: 

1. A set of ^=34 random 0/1 item responses for each of the 1000 simulees was 
generated by comparing a uniform random deviate no n for 34 times. r<K 
then a score of one was assigned to the item, otherwise a score of zero was 
assigned. 

2. All interval estimation procedures were applied to the simulated data, and 
intervals were constructed for each of the 1000 simulees. 

3. For each simulee, it was determined whether each interval contained the simulee's 
true (scale) score. If an interval covered the parameter, = 1 , otherwise, 
Q=0. 

4. The above steps were replicated /? = 1000 times and was calculated for 

r=\ 

each simulee, which represents the empirical number of times that the intervals 
obtained from repeated measurements include the true (scale) score. The actual 
coverage probability was computed for each simulee, each interval estimation 
procedure, and each type of scale score. 

The simulation procedure was replicated for three different nominal confidence 
levels: 95%, 68%, and 50%. The actual coverage probabilities obtained through the 
simulation were compared to these three nominal levels. These three nominal levels were 
used in the previous study by Jaijoura (1985). 
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Finally, note that the number of items in the original test was 34. To examine the 
effect of the number of items in the test, the whole simulation was repeated for it = 1 7 . 
In all other respects, the population characteristics were exactly the same. A shorter 
version of the conversion table was created somewhat arbitrarily by removing the even- 
numbered rows in the original conversion table. Consequently, the patterns of the 
transformations for the shorter test were remarkably similar to those of the original test. 
Figures 3 and 4 show the plots of the transformations for the two tests. 

Results 

Nominal 95% Intervals 

Table 2 contains averages and standard deviations of actual coverage probabilities 
for the nominal 95% intervals. The averages and standard deviations were computed 
based on 1000 simulees' actual coverage probabilities, and the averages are, in fact, the 
actual coverage probabilities over 1,000,000 intervals (1000 simulees times 1000 
replications). 

For the raw-score intervals, the score confidence interval, and the 

credibility interval with the non-informative prior, appear to show the actual 

coverage probabilities close to the nominal level of .95 with relatively small standard 
deviations. The Bayes confidence interval, 7^(r), and the Clopper-Pearson exact 
confidence interval, tend to be somewhat conservative (i.e., larger actual coverage 

probabilities than the nominal level). Recall that 7^(r) is supposed to have coverage 
probabilities that are constantly bounded below by the nominal confidence level. The 
somewhat conservative coverage probabilities of 7^ (r) are consistent with the previous 
results reported by Agresti and Coull (1998). The conditional confidence interval using 
Lord's SEM, 7^(r) , yielded the actual coverage probability that is too small and has the 
largest standard deviation, which is likely due to zero estimated SEMs for jc = 0 or k. 




27 



22 



With a zero SEM, the width of I^{t) is zero and the actual coverage probability can be 
too low. Apparently, the overall confidence interval using overall SEMs, /„(t) , seems to 
perform better than 7 ^(t) . It is not necessarily true, however, that a procedure showing a 
better overall coverage probability is more accurate across all levels of the score scale. 
Some procedures might be more accurate than other procedures near the middle of the 
score scale but less accurate at extremes. More discussion about this issue is presented 
later. 

The shaded areas in Table 2 and all the subsequent tables represent the coverage 
probabilities for the scale-score counterparts of each of the six raw-score intervals, /^,(^) , 
obtained by the endpoints conversion method with the true-score conversions. Note that 
the coverage probabilities for /^,(^) are exactly the same as those for corresponding raw- 
score intervals regardless of the types of scale scores. The last two columns of Table 2 
are for the conditional and overall scale-score confidence intervals, /^(^) and /„(^). 
Clearly, 7„(^) provides better actual coverage probabilities and smaller standard 
deviations than 7^(^). As for the conditional raw-score confidence intervals, zero 
estimated CSSEMs at both ends of the score scales are a major problem with 7^(^) . The 
results suggest that 7^(^) associated with "good" raw-score intervals such as the score 
and Bayes confidence intervals would be preferable to 7^(^) and 7„(^) for the nominal 
95% intervals. 

The plots of actual coverage probabilities for raw-score intervals are shown in 
Figure 5. Each dot represents the coverage probability for a single examinee. Notice that 
the actual coverage probability varies across the levels of the score scale. The coverage 
probabilities of 7 „(t) display a U-shape trend indicating that the actual coverage 
probability for an examinee would be either too large or too small depending upon where 
the examinee’s true score is located on the continuum, except for the regions where the 
reference line crosses the function of the actual coverage probabilities. As discussed later 
(i.e.. Figure 8), this is consistent with the fact that the pattern of the conditional raw-score 
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SEMs is an inverted U-shape, and thus the average SEM would be too large near both 
extremes and too small in the middle of the score scale. /„ (r) may not be adequate for 
reporting individual-level confidence intervals in practice. 

By contrast, the coverage probabilities for (r) are close to .95 in the middle of 
the score range, and tend to decline with increases in the absolute deviation of the raw 
score from the mid-score point. These results are consistent with the conventionally 
known fact that the normal approximation for the binomial parameter works best for 
values around .5, which is, in the present case, equivalent to the true score of 17. As 
discussed earlier, I^{r) shows a large drop in the coverage probabilities at the right end 
of the score scale approaching zero. The similar drop would have been noticed at 
extremely low true scores if the simulated data had contained enough data points at the 
region. Compared to /„(r), the coverage probabilities of however, are fairly 

consistently closer to the nominal level throughout most of the score scale. 

The actual coverage probabilities of /,,(t) and Ip(T) tend to be reasonably well 
scattered around the reference line. Notice, however, that both /,, (r) and (r) show a 
little drop at the right extreme. For I pit) , the endpoints of a 95% interval when ;c = 33 
and x = 34 (i.e., a perfect score) are (28.928, 33.762) and (30.599, 33.976), respectively. 
Suppose an examinee has a true score of 33.8. Whenever the examinee's observed score 
is less than perfect, the interval will not cover the examinee's true score. The upper 
endpoint of I pit) when x = k-l is the lower bound of a range of T values that falls in 
the interval only when x = k . The simulated data actually had three simulees with the 
true scores greater than 33.762, which exactly matches the number of dots in the plot that 
are far below the nominal level at the right end of the score scale. Although the 
simulated data do not contain such cases, a similar remark can be made for T values near 
zero. There is also a range of T values that can be covered in the interval only when 
jc = 0 . The range is bounded above by the lower endpoint of the interval when x = \ , 
which is .238 for k = 34 . The credibility interval has an additional problem. Note that 
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the upper endpoint of /^(t) is 33.976 when x = 34 . Thus, the coverage probability for 
an examinee with a true score greater than 33.976 will be zero necessarily. The present 
data do not have such high true score values. As discussed later, however, this actually 
happens with lower nominal confidence levels when the width of the interval gets 
narrower. The range of 7 values for which the actual coverage probability is necessarily 
zero gets smaller as the number of items increases. 

The score confidence interval has a similar, but less serious problem. The 
endpoints of the 95% score confidence interval when x = 33 is (28.929, 33.823). Note 
that the upper endpoint is larger than the corresponding value for /p(7). Again, the 
upper endpoint 33.823 is the lower bound of the range of z values that falls in the 
interval only when x = 34. There is only one simulee as shown in the plot who has a true 
score greater than 33.823. However, I^(z) does not have the problem of zero coverage 
probability for extremely high true scores as does I ^ ( 7 ) , because the upper endpoint of 
/, ( 7 ) when x = k is always k. 

The actual coverage probabilities of I^{z) and 7^(7) are almost uniformly larger 
than the nominal level along the entire score scale, with 7^(7) being somewhat more 
conservative. Compared to 7^(7) and 7^(7), the range of 7 values that is covered by the 
exact confidence interval only when x = k is very small. This range does not even exist 
for 7^(7) . When x = 33, the upper endpoint of 7^(7) is 33.975, but the upper endpoint of 
7^(7) is 34.426, which is greater than the maximum true score, k. The upper endpoint of 
7 ^( 7 ) when x = k is set equal to The upper endpoint of 7^(7) when x = k is allowed 
to be greater than k—the limit approaches k from above as k goes to infinity. 

The actual coverage probabilities for 7^(^) and 7„(^) are plotted in Figures 6 and 
7. Notice that, as for the raw score case, the coverage probabilities for the conditional 
scale-score confidence intervals are very low near the right extreme because of zero 
estimated CSSEMs when x = k. Unlike the results for 7^(7) and I^(z), however, the 
patterns of the coverage probabilities for the sctde-score confidence intervals across the 
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score scales tend to be irregular largely because of the non-linearity in the 
transformations. There are at least two potential sources of inaccuracy due to the non- 
linearity, which causes the coverage probabilities for the scale-score confidence intervals 
based on the normal approximation to deviate from the nominal levels: (a) bias in the 
estimated CSSEMs and (b) violation of the normality assumption. 

The degree of bias in the estimated CSSEMs can be evaluated by comparing them 
with the true CSSEMs. Since the parameter n is known for each simulee, the true SEMs 
can be computed. The true raw-score SEM for an examinee under the binomial error 
model is = ^Jkjr(l—7r) , and the true CSSEM is 



Figure 8 displays the true and mean estimated SEMs (over replications) for the 
raw and scale scores. The shape of the conditional raw-score SEMs is a concave-down 
parabola (Brennan, 1996, 1998; Feldt et al., 1985), and there does not seem to exist any 
noticeable bias in the estimated SEMs. The mean estimated overall raw-score SEM is a 
constant and an overestimate for examinees with very low and high true scores, but an 
underestimate for examinees in the middle of the true score distribution. 

The CSSEMs typically are irregular depending upon the pattern of non-linear 
transformations (Brennan & Lee, 1997, 1999; Feldt & Qualls, 1998; Kolen, Hanson, & 
Brennan, 1992). The CSSEMs are larger, in general, at the score points where the slope 
is steeper. However, in many cases, the estimated CSSEMs tend to be small at both 
extremes of the score scales regardless of the degree of the slope because the conditional 
raw-score SEMs are too small at the extremes (Brennan & Lee, 1997, 1999). Lee, 
Brennan, and Kolen (1998, 2000) reported that the estimated CSSEMs tended to be 
biased, and the direction of bias was related to the magnitude of the CSSEMs along the 
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score scale. As seen in Figure 8, the CSSEMs tend to be overestimated near the middle 
values of the true scale scores, at which the CSSEMs are the local minima. However, the 
CSSEMs are underestimated near the true scale scores showing the local maximum 
values of the CSSEMs. It may be noticed in Figures 6 and 7 with conditional DSS and 
GE intervals that the most accurate coverage probabilities are associated with the scale 
score values at which the pattern or slope of the transformations change (i.e., inflection 
points) shown in Figure 1. Indeed, Lee et al. (1998, 2000) found that the degree of bias 
in the estimated CSSEMs is smallest near the inflection points. Also, notice that the 
actual coverage probabilities for the PR intervals tend to be less irregular than the other 
scale score results, because the raw-to-PR transformation is nearly linear throughout most 
of the score scale. The constant estimated mean overall scale-score SEMs shown in 
Figure 8 produces bias throughout the score scales. Figure 9 shows the bias plots for the 
estimated SEMs. 

Figure 10 provides plots of the nominal 95% confidence intervals based on the 
normal approximation using the true conditional SEMs computed by Equation 23. The 
actual coverage probabilities are nearly uniformly distributed around the reference line of 
.95. Comparing Figure 10 and Figures 5, 6, and 7 along with Figure 9 provides a general 
idea about the effect of bias in the estimated SEMs on the actual coverage probabilities. 
It seems evident that the patterns of the coverage probabilities depicted in Figures 5 
(conditional and overall confidence intervals only), 6, and 7 tend to mirror, in general, the 
patterns of the bias functions shown in Figure 9. However, the actual coverage 
probabilities for the confidence intervals with conditional SEMs near very high and low 
true scores tend to be too small even though the estimated conditional SEMs do not 
exhibit any noticeably large bias. This appears to be caused by an SEM of zero when an 
observed raw score is equal to k. 

How can the actual coverage probability be too small when there is no bias in the 
mean estimated SEM? Let us take an example. Consider a simulee with a true score of 
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33.831 and the true SEM of 0.410. For this simulee, a large number of confidence 
intervals will not contain the true score when x = k because the estimated SEM is zero. 
However, the mean estimated SEM is close to the true SEM, because, for this simulee, x 
is equal to 34 or 33 in most cases and the estimated SEM is either zero or 1.0 (see 
Equation 4), and thus, the average becomes close to the true value of 0.410. 

The variability of the coverage probabilities in Figure 10 is indicative of the effect 
of the violation of the normality assumption. The degree of the violation seems to vary 
depending on the types of scale scores and the location of the true (scale) scores. 
Obviously, the results for ST show the largest variability. This issue is discussed in a 
greater detail in the next section of nominal 68% intervals. 

Nominal 68% Intervals 

The results for the nominal 68% intervals are summarized in Table 3. In general, 
the conditional confidence intervals for both raw and scale scores, I^{t) and /^.(^), 
outperform the overall confidence intervals, /„(r) and /„(^) , with respect to both the 
average and standard deviation of the actual coverage probabilities. Especially, I^{t) 
performed nearly as well as I ^ (r) and (r) . For scale scores, /„ (^) appears to provide 
better coverage probabilities than 7^ (^) and 7^ (^) , when the endpoints of 7„ (^) are 
obtained from converting the endpoints of any raw-score intervals except for 7„(r) and 

Note that the coverage probabilities for I„(^) are larger than .68 for all four types 
of scale scores. The results also show that 7^(r) works better than the others, which in 
turn, leads to the better performance of the scale-score counterparts of 7^, (r) . The actual 
coverage probabilities of 7^(r) are excessively large. In comparison with the results for 
the nominal level of .95, the standard deviations of the actual coverage probabilities for 
the 68% confidence intervals tend to be much larger. One reason might be that the 
coverage probabilities of 68% confidence intervals have more room to move up and 
down. 
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The results for 68% raw-score intervals are plotted in Figure 11. Clearly, I^{t) 
provides the coverage probabilities that are closer to the nominal level than 7 „(t) 
throughout most of the score scale. Also notice that the patterns of the actual coverage 
probabilities for 7 ^(t) , and 7^(7) are remarkably similar, except that 7 ^(t) shows 

a zero coverage probability at the true score of near 34. The endpoints of a 68% 
credibility interval when X =34 are (32.266, 33.831). Since jc = 34 is the maximum 
number of items correct, the upper endpoint of the interval can not be greater than 
33.831. Thus, the coverage probability of the credibility interval for an examinee with 
the true score greater than 33.831 will be zero regardless of the examinee's observed 
score, and the simulated data contain one simulee with such a high true score. A similar 
remark can be made for the other end of the score range. 

Figures 12 and 13 depict the actual coverage probabilities for 68% scale-score 
confidence intervals. As noted in Table 3, 7^(^) shows much better patterns for the 
actual coverage probabilities than except for the ST results. The excessive 

variation in the coverage probabilities for both conditional and overall ST confidence 
intervals makes them totally unacceptable. Since the plot for the ST counterpart of, for 
example, 7, (t) will be the same as the plot for 7, (t) (i.e., the coverage probabilities for 
7„(^) are identical to those for the corresponding raw-score intervals), the endpoints 
conversion method clearly provides better coverage probabilities for STs. 

Figure 14 contains plots for the nominal 68% confidence intervals based on the 
normal approximation using the true conditional SEMs. Since the true SEMs are used 
here, the variations in the actual coverage probabilities are mainly due to the violation of 
the normality assumption. It appears that the normality assumption does not hold very 
well for STs and PRs at both ends of the score scale. For the sake of argument, let us 
consider the PR case, and presume that the normality assumption holds fairly well across 
the entire range of the raw-score scale. As shown in Figure 3, the slope of the raw-to-PR 
transformation is almost linear along the score scale except for both extremes where the 
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transformation becomes flat. Thus, the PR distribution at very high or low score points 
will be much narrower than the raw-score distribution, which, in turn, will result in large 
actual coverage probabilities as seen in Figure 14. 

Nominal 50% Intervals 

The actual coverage probabilities for the nominal 50% intervals are presented in 
Table 4. The inferences that can be made from Table 4 are pretty much the same as those 
that can be made from Table 3 for the 68% intervals. Some minor differences include 
that the results for 50% intervals show slightly larger standard deviations than those of 
the 68% intervals, in general. Also, the better performance of the conditional confidence 
intervals than the overall confidence intervals becomes more salient. In addition, the 
coverage probabilities for /^(r) now tend to exceed the nominal level to a prohibitive 
degree. 

Plots are provided in Figures 15 through 18. The coverage probabilities for I^{r) 
and /ft(r) tend to get more similar as the nominal level decreases shown in Figure 15. 
The coverage probabilities for the nominal 50% scale-score intervals (Figures 16 and 17) 
are generally more variable than those for the higher nominal levels. In particular, the ST 
confidence intervals display overly variable coverage probabilities. The large variation in 
the coverage probabilities of the ST confidence intervals is primarily due to the many-to- 
one conversion characteristics of the raw-to-ST transformation (see Figure 3). Likewise, 
the coverage probabilities for PR intervals tend to be more variable than those for DSSs 
and GEs at both tails of the score scale. As shown in Figure 3, several raw-score points 
are converted to the same PR point at both ends of the raw-to-PR transformation. In 
addition, the coverage probabilities for the PR intervals are fairly flat, like those for the 
conditional raw-score intervals, which is associated with the approximately linear pattern 
of the raw-to-PR transformation throughout a wide range of the score scale. The actual 
coverage probabilities for the conditional 50% confidence intervals using the true 
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conditional SEMs (i.e., Figure 18) show patterns that are similar to those for the 68% 
intervals, except that the 50% ST intervals and PR intervals at both ends tend to produce 
very low coverage probabilities due to narrow intervals coupled with narrow scale score 
distributions. 

Intervals for the Half-Length Test 

The whole simulation study was repeated for a test with a smaller number of 
items {k = \l), and the results for the three nominal confidence levels are summarized in 
Tables 5, 6, and 7. A more meaningful interpretation of these results might be made 
through a compmson with the results for the original test. One apparent difference is 
that, with the shorter test, the standard deviations of the actual coverage probabilities for 
all interval estimation procedures are somewhat larger than those with the full-length test. 
As expected, the conditional confidence intervals performed worse with the shorter test— 
they produced coverage probabilities that are noticeably lower or higher than the nominal 
levels. Also notice that the overall coverage probabilities for f{z) for the shorter test are 
leu'ger than those for the longer test, which were themselves larger than the nominal 
levels. In general, the other three raw-score intervals tend to work slightly worse with the 
shorter test. 

A series of figures is provided for the shorter test results: Figures 19 - 21 for 95%; 
Figures 22 - 24 for 68%; and Figures 25 - 27 for 50%. A few comments will suffice. 
The general patterns of the actual coverage probabilities are very similar to those for the 
full-length test. For the scale-score confidence intervals, the similarity might be due to 
the simileu' pattern of the transformations for the two tests as shown in Figures 3 and 4. 
The plots for the shorter tests show more white spaces between chunks of dots, however, 
which is due to discreteness. There are only 17-*-l possible intervals, relative to the 
longer test for which there are 35 possible intervals. 
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Discussion 

Agresti and Coull (1998) recommended score confidence intervals for nearly any 
sample size and parameter value, and the results of the present study support this 
recommendation. One minor drawback of the score confidence interval is that, as 
discussed earlier, it would have an actual coverage probability that is far below the 
nominal confidence level for an examinee with the true proportion-correct score of near 
zero or one. As the number of items increases, however, the problem diminishes. On 
average, the score confidence intervals provided the actual coverage probabilities closest 
to the nominal levels regardless of the test length. 

One interesting observation is that the actual coverage probabilities for the score 
confidence intervals in Figures 5, 11, and 15 are almost identical to those for the 
conditional raw-score confidence intervals using the true SEMs (Figures 10, 14, and 18). 
This appears to be related to the fact that the true SEM is defined in this paper based on 
the binomial model, and that the derivation of the endpoints for a score confidence 
interval involves use of the true SEMs (see Equations 7 and 8). 

In general, credibility intervals with a non-informative prior worked very well. 
One conceptual advantage of the credibility intervals is that we can make a probabilistic 
statement about a particular interval. There seem to be two technical problems with the 
credibility intervals, however. One, as with the score confidence intervals, there are true 
score regions at which the actual coverage probabilities could be too low, and the regions 
tend to be slightly larger than those associated with the score confidence intervals. This 
problem diminishes as the test gets longer. Two, especially for lower nominal levels, the 
actual coverage probabilities can drop to zero for extremely high or low true scores. A 
practical solution to the second problem might be to set the upper endpoint of the interval 
equal to the maximum number-correct score for an examinee with a perfect score, and the 
lower endpoint equal to zero for an examinee with zero score. 
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The results of the present study clearly showed that the Clopper-Pearson exact 
confidence intervals give actual coverage probabilities that substantially exceeded the 
nominal confidence levels. The exact confidence intervals are useful, however, as a 
conservative procedure. That is, with these intervals we can be sure that intervals will, on 
average, have at least the desired coverage probability regardless of score levels. Of 
course, if it is desired to have coverage probabilities as close as possible to the specified 
level at all score points, then the score and credibility intervals would be preferable. 

The performance of the Bayes confidence intervals was acceptable, and it worked 
especially well with the nominal levels of .68 and .50. One advantage of the Bayes 
confidence intervals is that it does not have the problem of seriously low coverage 
probabilities. Also, the form of the Bayes intervals is identical to the familiar confidence 
interval form of x±{Zy)SEM using the Bayes estimate in place of the mean observed 
score. 

Users might still insist on using intervals that involve adding and subtracting 
estimated SEMs multiplied by a z-score, since they are very popular and easy to 
implement. In such cases, it is recommended that the conditional SEMs be used rather 
than the overall SEM, especially for moderate and small confidence levels. Though the 
traditional overall confidence intervals, on some occasions, provide the overall actual 
coverage probabilities closer to the nominal confidence level, the intervals using 
conditional SEMs tend to produce the actual coverage probabilities that are constantly 
closer to the nominal level across the wide range of the score scale. This 
recommendation is applicable to both raw- and scale-score confidence intervals. 

In addition, note that the computation of the conditional confidence intervals is 
based on test data for a single examinee only, whereas the traditional confidence intervals 
using the overall SEM make use of data from other examinees. Therefore, the 
conditional confidence interval might be more appropriate for a uniquely identifiable 
examinee. For instance, a counselor dealing with an individual student (especially one 
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who is particularly challenged or able) may be well advised to use an interval such as 
I^{t) , because it is based on the student's test data only. By contrast, a test publisher, not 
knowing individual examinees, may opt for reporting intervals such as /„(r) in test 
manuals. 

The accuracy of the actual coverage probabilities for the conditional scale-score 
confidence intervals appears to depend upon (1) the pattern of transformation (i.e., slope), 
(2) the accuracy of the estimated CSSEMs, and (3) the transformation type (one-to-one or 
many-to-one). The pattern of the transformation is closely related to the accuracy of the 
estimated CSSEMs. Both the pattern and type of the transformation are important factors 
since they can distort the shape of the conditional scale score distributions and thus the 
normality assumption may not hold any more for the scale scores. Given the fact that the 
normal approximation works fairly well for the raw scores, the more severe the 
transformation, the more likely the normality assumption is violated for the scale scores. 

When constructing intervals for scale scores, the results presented here suggest 
that the endpoints conversion method using the true-score conversion is preferable to the 
normal approximation approach. It is recommended that the normal approximation be 
used for scale-score confidence intervals only when the transformation is approximately 
linear. One comment on the true-score conversion should be made. In order to get the v 
transformation, which converts true scores into true scale scores, we begin with obtaining 
the observed score distribution given r . Doing so requires a model, and in the present 
case, the binomial error model was used. Although we can use any psychometric model 
for the true-score conversion that is assumed to hold for our data, such as one based on 
item response theory, the actual coverage probability for the scale-score counterparts will 
always be the same as the raw-score coverage probability. 

Although this paper employed the true-score conversion for the endpoints 
conversion method, the observed-score conversion (i.e.. Equation 20) could be another 
alternative. For a nearly one-to-one transformation, the two conversion functions will be 
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very similar, and the resultant scale-score endpoints will be very similar as well. 
However, when the transformation is many-to-one, such as the raw-to-ST transformation 
considered in this paper, the true-score conversion would be smoother than the observed- 
score conversion, and provide somewhat better coverage probabilities (see Lee, 1998 for 
the results of the ST confidence intervals using the observed-score conversion). In 
general, the true score conversion approach would be preferred because it is consistent 
with the fact that the endpoints are on the metric of true scores, and it always produces 
the same coverage probability for the raw- and scale-score intervals. Depending upon the 
type of the transformation, the observed-score conversion approach might be preferred 
because it is relatively easy to implement using a simple interpolation procedure. 
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TABLE 1 

Endpoints of Nominal 68% Intervals at Some Selected Observed 
Raw and DSS Scores with ^ = 34 







/.(f) 


/.(f) 


/;,(f) 


/c(^) 


x = 5, DSS= 141 


Raw 2.9. 7.1 


3.3, 7.4 


3.2, 7.5 


2.9, 8.0 


3.6, 7.7 




DSS 133 . 4 , 150.9 


134.8. 152.3 


134.6. 152.6 


133.4. 154.8 


136.0, 153.8 


132.1, 149.9 


x= 10, DSS= 165 


Raw 7.3, 12.7 


7.6, 12.8 


7.5. 12.9 


7.1, 13.3 


7.8, 13.0 




DSS 152.0, 173.8 


153.2. 174.2 


153.0. 174.4 


151.2. 175.9 


154.1, 174.8 


153.8, 176.2 


;c= 17,D55= 187 


Raw 14.1, 19.9 


14.1, 19.9 


14.1, 19.9 


13.7,20.3 


14.2, 19.8 




DSS 178.2. 193.8 


178.5, 193.6 


178.3.' 193.7 


177.0. 194.8 


178.6, 193.5 


179.4, 194.6 


X = 25, DSS = 207 


Raw 22.4, 27.6 


22.2, 27.3 


22.2, 27.4 


21.7,27.7 


22.0, 27.1 




DSS 200.2.218.1 


199.8,216.8 


199.7. 217.0 


198.4,218.7 


199.3,215.8 


198.1,215.9 


X = 30, DSS = 229 


Raw 28.1,31.9 


27.8,31.5 


27.7, 31.6 


27.2,31.9 


27.4, 31.2 




DSS 220.4.241.9 


218.8, 239.2 


218.5. 239.6 


216.3, 241.8 


217.1, 237.0 


218.3, 239.7 



Raw-to-DSS Conversion 


Raw 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


DSS 


124 


127 


130 


133 


137 


141 


146 


150 


155 


160 


165 


169 


Raw 


12 


13 


14 


15 


16 


17 


18 


19 


20 


21 


22 


23 


DSS 


173 


176 


179 


182 


184 


187 


189 


191 


194 


196 


199 


201 


Raw 


24 


25 


26 


27 


28 


29 


30 


31 


32 


33 


34 




DSS 


204 


207 


210 


214 


219 


224 


229 


234 


241 


250 


261 
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TABLE 2 

Average Actual Coverage Probabilities of 
Nominal 95% Intervals with /: = 34 



















/c(^) 


iM) 


Raw 


Mean 


.924 


.947 


.952 


.967 


.970 


.953 








SD 


.062 


.028 


.012 


.010 


.009 


.014 






DSS 


Mean 


L924 


.947 


.952 


.967 


.970 


.953 


.944 


.949 




SD 


.062 


.028 


.012 


.010 


.009 


.014 


.068 


.032 


GE 


Mean 


.924 


.947 


.952 


.967 


.970 


.953 


.944 


.949 




SD 


.062 


.028 


.012 


.010 


.009 


.014 


.070 


.053 


PR 


Mean 


.924 


.947 


.952 


.967 


.970 


.953 


.889 


.941 




SD 


.062 


.028 


.012 


.010 


.009 


.014 


.074 


.037 


ST 


Mean 


.924 


.947 


.952 


.967 


.970 


.953 


.911 


.943 




SD 


.062 


.028 


.012 


.010 


.009 


.014 


.087 


.039 



Note: Shaded area is for /„(^) = scale-score counterparts using true-score conversions; 
/^(t) = conditional confidence intervals using conditional SEMs; 

/„(t) = overall confidence intervals using overall SEMs; 

/,(t) = score confidence intervals; 

/^(t) = Bayes confidence intervals; 

I^{t) = Clopper-Pearson exact confidence intervals; 

Ip{t) = credibility intervals with non-informative prior; 

^c(^) = conditional scale-score confidence intervals using CSSEMs; and 
I„{^) = overall scale-score confidence intervals using overall scale-score SEMs. 
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TABLE 3 

Average Actual Coverage Probabilities of 
Nominal 68% Intervals with A: = 34 







/.(r) 






hir) 


Kir) 








Raw 


Mean 


.676 


.693 


.681 


.691 


nee 


.686 








SD 


.058 


.093 


.047 


.047 


.046 


.052 






DSS 


Mean 


.676 


.693 


.681 


.691 


nee 


.686 


.697 


.709 




SD 


.058 


.093 


.047 


mi 


.046“ 


.052 


.071 


.096 


GE 


Mean 


.676 


.693 


.681 


.691 


nee 


.686 


.700 


.728 




SD 


.058 


.093 


.047 


.047 


.046 


.052 


.083 


.126 


PR 


Mean 


.676 


.693 


.681 


.691 


nee 


.686 


.652 


.695 




SD 


.058 


.093 


.047 


.047 


.046 


.052 


.066 


.144 


ST 


Mean 


.676 


.693 


.681 


.691 


nee 


“.686 


.670 


.701 




SD 


.058 


.093 


.047 


- .047 


.046 


.052 


.153 


.194 



Note: Shaded area is for /^(^) = scale-score counterparts using true-score conversions; 
4(r) = conditional confidence intervals using conditional SEMs; 

7„(r) = overall confidence intervals using overall SEMs; 

/,(t) = score confidence intervals; 

7j,(r) = Bayes confidence intervals; 

7^(r) = Clopper-Pearson exact confidence intervals; 

Ip{t) - credibility intervals with non-informative prior; 

7^(^) = conditional scale-score confidence intervals using CSSEMs; and 
7„(^) = overall scale-score confidence intervals using overall scale-score SEMs. 
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TABLE 4 

Average Actual Coverage Probabilities of 
Nominal 50% Intervals with = 34 













hi'^) 






IM) 


IXi) 


Raw 


Mean 


.497 


.519 


.499 


.503 


.617 


.501 








SD 


.064 


.114 


.060 


.058 


.057 


.063 






DSS 


Mean 


.497 


.519 


.499 


.503 


.617 


.501 


516 


.526 




SD 


.064 


.114 


.060 


.058 


.057 


.063 


074 


.102 


GE 


Mean 


.497 


.519 


.499 


.503 


.617 


.501 


525 


.545 




SD 


.064 


.114 


.060 


.058 


.057 


.063 


091 


.130 


PR 


Mean 


.497 


.519 


.499 


.503 


.617 


.501 


487 


.537 




SD 


.064 


.114 


.060 


.058 


.057 


.063 


074 


.188 


ST 


Mean 


.497 


.519 


.499 


.503 


.617" 


.501 


490 


.521 




SD 


.064 


.114 


.060 


.058 


.057 


.063 


155 


.164 



Note: Shaded area is for /„(^) = scale-score counterparts using true-score conversions; 
/^(t) = conditional confidence intervals using conditional SEMs; 

/„(t) = overall confidence intervals using overall SEMs; 

/,(t) = score confidence intervals; 

/^(t) = Bayes confidence intervals; 

I^{t) = Clopper-Pearson exact confidence intervals; 

I p{T) = credibility intervals with non-informative prior; 

/^(^) = conditional scale-score confidence intervals using CSSEMs; and 
I„{^) = overall scale-score confidence intervals using overall scale-score SEMs. 
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TABLE 5 

Average Actual Coverage Probabilities of 
Nominal 95 % Intervals with k = \l 







/c(^) 












/c(^) 




Raw 


Mean 


.898 


.947 


.954 


.977 


.976 


.954 








SD 


.089 


.030 


.015 


.010 


.010 


.034 






DSS 


Mean 


.898 


.941 


.954 


.977 


.976 


.954 


.928 


.951 




SD 


.089 


.030 


.015 


.010 


.010 


.034 


.103 


.029 


GE 


Mean 


.898 


.947 


.954 


.977 


.976 


.954 


.926 


.947 




SD 


.089 


.030 


.015 


.010 


.010 


.034 


.105 


.060 


PR 


Mean 


.898 


.947 


.954 


.977 


. .976 


.954 


.856 


.940 




SD 


.089 


.030 


.015 


.010 


.010 


.034 


.090 


.043 


ST 


Mean 


.898 


.947 


.954 


.977 


.976 


.954 


.887 


.947 




SD 


.089 


.030 


.015 


.010 


.010 


.034 


.104 


.031 



Note: Shaded area is for /^(^) = scale-score counterparts using true-score conversions; 
/^(r) = conditional confidence intervals using conditional SEMs; 

/ (r) = overall confidence intervals using overall SEMs; 

/ (r) = score confidence intervals; 

7^(t) = Bayes confidence intervals; 

7^(t) = Clopper-Pearson exact confidence intervals; 

I = credibility intervals with non-informative prior; 

7^(^) = conditional scale-score confidence intervals using CSSEMs; and 
1„{^) = overall scale-score confidence intervals using overall scale-score SEMs. 
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TABLE 6 

Average Actual Coverage Probabilities of 
Nominal 68% Intervals with k = \l 













hiT^) 






/c(^) 


iM) 


Raw 


Mean 


.666 


.693 


.676 


.697 


.803 


.690 








SD 


.075 


.115 


.063 


.063 


.052 


.062 






DSS 


Mean 


.666 


.693 


.676 


.697 


.803 


.690 


.703 


non 




SD 


.075 


.115 


.063 


.063 


.052 


.062 


.106 


.104 


GE 


Mean 


.666 


.693 


.676 


.697 


.803 


.690 


.709 


.731 




SD 


.075 


.115 


.063 


.063 


.052 


.062 


.108 


.129 


PR 


Mean 


.666 


.693 


.676 


.697 


.803 


.690 


.635 


.695 




SD 


.075 


.115 


.063 


.063 


.052 


.062 


.079 


.149 


ST 


Mean 


.666 


693. 


.676 


.697 


.803 


.690 


.665 


.702 




SD 


.075 


.115 


.063 ' 


' .063 


.052 


.062 


.136 


.138 



Note: Shaded area is for I^{^) = scale-score counterparts using true-score conversions; 
/<.(t) = conditional confidence intervals using conditional SEMs; 

= overall confidence intervals using overall SEMs; 

/,(t) = score confidence intervals; 

(t) = Bayes confidence intervals; 

I^{f) = Clopper-Pearson exact confidence intervals; 

I pit) = credibility intervals with non-informative prior; 

Ici^) = conditional scale-score confidence intervals using CSSEMs; and 
I„{^) = overall scale-score confidence intervals using overall scale-score SEMs. 
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TABLE 7 

Average Actual Coverage Probabilities of 
Nominal 50% Intervals with k=\l 







/c(7) 






hi^) 


/.(7) 






/„(#) 


Raw 


Mean 


.495 


.513 


.496 


.505 


.661 


.510 








SD 


.088 


.130 


.090 


.086 


.078 


.084 






DSS 


Mean 


.495 


.513 


.496 


.505 


.661 


;510 


.530 


.531 




SD 


.088 


.130 


.090 


.086 


.078 


.084 


.099 


.118 


GE 


Mean 


.495 


.513 


.496 


.505 


.661 


.510 


.535 


.552 




SD 


.088 


.130 


.090 


.086 


.078 


.084 


.108 


.149 


PR 


Mean 


.495 


.513 


.496 


.505 


.661 


.510 


.468 


.526 




SD 


.088 


.130 


.090 


.086 


.078 


.084 


.082 


.192 


ST 


Mean 


.495 


.513 


.496 


.505 


.661 


.510 


.484 


.526 




SD 


.088 


.130 


.090 


.086 


.078 


.084 


.178 


.209 



Note: Shaded area is for I^{^) = scale-score counterparts using true-score conversions; 
/^(r) = conditional confidence intervals using conditional SEMs; 

7„(r) = overall confidence intervals using overall SEMs; 

7,(r) = score confidence intervals; 

7^(r) = Bayes confidence intervals; 

7^(r) = Clopper-Pearson exact confidence intervals; 

7p(r) = credibility intervals with non-informative prior; 

7^(^) = conditional scale-score confidence intervals using CSSEMs; and 
1„{^) = overall scale-score confidence intervals using overall scale-score SEMs. 



ERIC 



51 



46 



FIGURE 1 . True-Score and Observed-Score Conversions 



True-Score Conversion 



Observed-Score Conversion 




Raw to PR 




Raw to ST 




ERIC 



5 2 best copy avaiuble 



47 



FIGURE 2 . Lengths of Nominal 68% Intervals for Raw and DSS Scores with fc = 34 
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FIGURE 3 . Raw to Scale-Score Transformations with ifc = 34 
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FIGURE 4 . Raw to Scale-Score Transformations with A: = 17 
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FIGURE 5 . Actual Coverage Probabilities of Nominal 95 % 
Raw-Score Intervals with = 34 
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FIGURE 6. Actual Coverage Probabilities of Nominal 95% Scale - Score Intervals 
Using Conditional Scale - Score SEMs [/^ (^)] with k = 34 
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FIGURE 7. Actual Coverage Probabilities of Nominal 95% Scale - Score Intervals 
Using Overall Scale - Score SEMs [/„ (^)] with A: = 34 
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FIGURE 8 . True and Estimated SEMs with fc = 34 
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FIGURE 9 . Bias for Estimated SEMs with fc = 34 
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FIGURE /O. Actual Coverage Probabilities of Nominal 95% Conditional 

ConHdence Intervals Using True Conditional SEMs v^^ith A: = 34 
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FIGURE 11 . Actual Coverage Probabilities of Nominal 68% 
Raw-Score Intervals with fc = 34 
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FIGURE 12. Actual Coverage Probabilities of Nominal 68% Scale - Score Intervals 
Using Conditional Scale - Score SEMs [/^ (^)] with ^ = 34 
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FIGURE 13, Actual Coverage Probabilities of Nominal 68% Scale - Score Intervals 
Using Overall Scale - Score SEMs [/„ (^)] with k = 34 
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FIGURE i4. Actual Coverage Probabilities of Nominal 68% Conditional 

ConHdence Intervals Using True Conditional SEMs with k =34 
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FIGURE 15 . Actual Coverage Probabilities of Nominal 50% 
Raw-Score Intervals with = 34 



Conditional CIs [/ (t)] 



CQ 

JD 

O 



W) 

> 

O 

U 



1.0 
0.9 H 
0.8 
0.7 
0.6 > 
0.5 - 
0.4 - 
0.3 - 
0.2 
0.1 H 
0.0 






•V.> 




I.. ^ 



0 5 10 15 20 25 30 35 

True Score 



Overall CIs [/^ (t)] 




10 15 20 25 30 35 

True Score 



Score CIs [/^. (t)] 



X) 

JD 

O 



W) 

c« 

D 

> 

O 

U 



1.0 
0.9 -\ 
0.8 
0.7 ^ 
0.6 
0.5 
0.4 
0.3 H 
0.2 
0.1 H 
0.0 

0 



vJt 






I I 1 1 1 1 1 

5 10 15 20 25 30 35 

True Score 



■§ 

■s 



u 

Of) 

C3 

(-1 

U 

> 

O 

U 



1.0 -| 
0.9 - 
0.8 - 
0.7 - 
0.6 
0.5 
0.4 -\ 
0.3 
0.2 - 
0.1 - 
0.0 



Bayes CIs[/^(r)] 






0 5 10 15 20 25 30 35 

True Score 



Exact CIs [7g (t)] 



Credibility Intervals [/ (t)] 



■g 

■s 



W) 

C3 

> 

O 

U 



0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 





'4 


1 1 1 T 1 






10 15 20 25 30 35 

True Score 



0 5 10 15 20 25 30 35 

True Score 



o 

ERIC 



66 



61 



FIGURE 16. Actual Coverage Probabilities of Nominal 50% Scale - Score Intervals 
Using Conditional Scale - Score SEMs [/^ (^)] with fc = 34 
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FIGURE 17. Actual Coverage Probabilities of Nominal 50% Scale - Score Intervals 
Using Overall Scale - Score SEMs [/„ (^)] with Ar = 34 
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FIGURE 18 . Actual Coverage Probabilities of Nominal 50% Conditional 

ConHdence Intervals Using True Conditional SEMs with A: = 34 
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FIGURE 19 . Actual Coverage Probabilities of Nominal 95% 
Raw-Score Intervals with A: = 17 
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FIGURE 20. Actual Coverage Probabilities of Nominal 95% Scale - Score Intervals 
Using Conditional Scale - Score SEMs [7^ (^)] with k = \l 
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FIGURE 21. Actual Coverage Probabilities of Nominal 95% Scale - Score Intervals 
Using Overall Scale - Score SEMs [/„ (^)] with k = \l 
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FIGURE 22 . Actual Coverage Probabilities of Nominal 68% 
Raw-Score Intervals with A: = 17 
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FIGURE 23. Actual Coverage Probabilities of Nominal 68% Scale - Score Intervals 
Using Conditional Scale - Score SEMs [7^ (^)] with = 17 
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FIGURE 24. Actual Coverage Probabilities of Nominal 68% Scale - Score Intervals 
Using Overall Scale - Score SEMs [/„ (^)] with k=ll 
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FIGURE 25 . Actual Coverage Probabilities of Nominal 50% 
Raw-Score Intervals with k =11 
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FIGURE 26. Actual Coverage Probabilities of Nominal 50% Scale - Score Intervals 
Using Conditional Scale - Score SEMs [7^ (^)] with Jfc = 17 
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FIGURE 27. Actual Coverage Probabilities of Nominal 50% Scale - Score Intervals 
Using Overall Scale - Score SEMs [7^ (^)] with k =11 
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